Skip to main content

Getting Started with TTS

๐Ÿ—ฃ๏ธ What is Text-to-Speech (TTS)?โ€‹

Text-to-Speech (TTS) is a technology that converts written text into spoken voice. It allows computers to "speak" words out loud using synthetic or recorded voices. In the context of SkyrimNet, TTS is used to give NPCs the ability to talk dynamically โ€” not just with pre-recorded lines, but with sentences generated on the fly by the AI.

๐Ÿ”ง How It Worksโ€‹

TTS works in two steps: first, the system breaks down the written sentence into sounds (called phonemes), and then it uses a voice model to produce speech audio from those sounds. SkyrimNet uses modern neural TTS systems (like XTTS or Zonos) to make the voices sound natural and emotional โ€” as if a real person were speaking. These voices can be customized to sound robotic, dramatic, calm, or even imitate characters.

๐ŸŽฎ Why It Matters in SkyrimNetโ€‹

With TTS, SkyrimNet can give life to AI-powered NPCs who speak new, personalized lines every time they interact with you. This means conversations are no longer limited to pre-written dialogue. NPCs can comment on your actions, remember past events, or express emotion โ€” all in their own voice, without requiring voice actors or modding tools like Creation Kit.

๐Ÿ”Š SkyrimNet TTS Engine Comparison

SkyrimNet supports multiple TTS backends โ€” each with unique strengths in quality, speed, and customization. Here's a side-by-side comparison of Zonos, XTTS, and Piper, so users can choose the best engine for their systems.


โš–๏ธ Feature Comparisonโ€‹

Feature๐Ÿง  Zonos๐Ÿ—ฃ๏ธ XTTS (Default)โšก Piper
Voice Quality๐ŸŽ™๏ธ Studio-grade, cinematic๐ŸŽง Very high, expressive๐Ÿ”‰ Good, clean, lightweight
Voice Cloningโœ… Yes, identity cloningโœ… Yes, from voice sampleโŒ No cloning
Emotional Control๐ŸŸก Planned๐ŸŸก Basic support (tone hints)โŒ None
Accent/Language Supportโœ… Wideโœ… Cross-lingual๐ŸŸก Limited
Speedโš ๏ธ Slower (heavier inference)โœ… Moderate (~1โ€“2s latency)โšก Instant (~100โ€“200ms)
Local Integration๐ŸŒ Local HTTP endpoint๐ŸŒ Local HTTP endpoint๐Ÿงฉ In-process (no server)
Output FormatWAV / PCMWAV / PCMPCM (16-bit mono, 22050Hz)
Best Usefollowers/ high end systemGeneral dialogue, dynamic LLMBackground NPCs, fast chatter

๐Ÿ”‹ Resource Usage (Approximate)โ€‹

EngineCPU UsagevRAM UsageLoad TimeNotes
Zonos๐Ÿ”ฅ High๐Ÿ”ฅ High (aprox 6GB)๐Ÿ•’ ~1โ€“3sLarge models, best for key scenes
XTTSโš ๏ธ Moderateโš ๏ธ Moderate (aprox 3GB )๐Ÿ•’ ~1โ€“2sReal-time feasible, very flexible
Piperโœ… Lowโœ… none (cpu only)โšก InstantFastest, most efficient TTS

๐Ÿงช Note: Resource usage depends on the hardware and specific model used. GPU acceleration improves both Zonos and XTTS significantly, its load times can reach instant on high end systems


๐ŸŽฏ Summaryโ€‹

EngineStrengthsTradeoffs
ZonosCinematic quality, voice cloning, emotional nuanceSlower, heavier; ideal for premium content
XTTSGreat balance of quality, cloning, and speedSlight delay; ocasional voice drifts
PiperExtremely fast and lightweight for real-time interactionNo cloning or advanced voice features

๐Ÿ› ๏ธ Choosing the Right Engineโ€‹

ScenarioRecommended Engine
Voiced main quest with drama/emotion๐ŸŽ™๏ธ Zonos
Companion with a personalized voice๐Ÿ—ฃ๏ธ XTTS
Fast ambient barks / guards / vendorsโšก Piper
Fully dynamic AI-driven conversations๐Ÿ—ฃ๏ธ XTTS
Low-end PC / reduced available vramโšก Piper

TL;DRโ€‹

  • Zonos = Premium, cinematic, cloned voices with deep expression
  • XTTS = Default engine with cloning and great all-around quality
  • Piper = Fastest engine, perfect for lightweight real-time voice playback

All three engines can eventually be mixed and matched per actor or event within SkyrimNet for optimal performance and immersion. (note: not currently as of beta4)